Automatic Language-Independent Induction of Gazetteer Lists
نویسندگان
چکیده
Adaptation of existing Information Extraction (IE) systems to new languages and domains is the focus of much current research, but progress is often hindered by the lack of available resources to enable developers to get a new system up and running fast. It has previously been shown that a good set of gazetteer lists can have a vital role here, but creation of lists for a new language or domain can be time-consuming and laborious. In this paper we demonstrate a tool for inducing gazetteer lists from a small set of annotated corpora and creating a baseline IE system. We also describe an extension to this, using bootstrapping techniques in order to generate much larger volumes of noisy training texts. High quality results have been achieved in this way on Hindi, Chinese and Arabic.
منابع مشابه
Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists
Current Named Entity Recognition systems suffer from the lack of hand-tagged data as well as degradation when moving to other domain. This paper explores two aspects: the automatic generation of gazetteer lists from unlabeled data; and the building of a Named Entity Recognition system with labeled and unlabeled data.
متن کاملGazetteer Preparation for Named Entity Recognition in Indian Languages
This paper describes our approaches for the preparation of gazetteers for named entity recognition (NER) in Indian languages. We have described two methodologies for the preparation of gazetteers1. Since the relevant gazetteer lists are more easily available in English we have used a transliteration based approach to convert available English name lists to Indian languages. The second approach ...
متن کاملN-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language
Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately...
متن کاملNamed Entity Recognition in Hindi using Maximum Entropy and Transliteration
(NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have describ...
متن کاملA Novel Approach to Automatic Gazetteer Generation using Wikipedia
Gazetteers or entity dictionaries are important knowledge resources for solving a wide range of NLP problems, such as entity extraction. We introduce a novel method to automatically generate gazetteers from seed lists using an external knowledge resource, the Wikipedia. Unlike previous methods, our method exploits the rich content and various structural elements of Wikipedia, and does not rely ...
متن کامل